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Sufficient dimension reduction (SDR) in regression, which reduces 
the dimension by replacing original predictors with a minimal set of 
their linear combinations without loss of information, is very helpful 
when the number of predictors is large. The standard SDR methods 
suffer because the estimated linear combinations usually consist of 
all original predictors, making it difficult to interpret. In this paper, 
we propose a unified method — coordinate-independent sparse esti- 
mation (CISE) — that can simultaneously achieve sparse sufficient di- 
mension reduction and screen out irrelevant and redundant variables 
efficiently. CISE is subspace oriented in the sense that it incorporates 
a coordinate-independent penalty term with a broad series of model- 
based and model-free SDR approaches. This results in a Grassmann 
manifold optimization problem and a fast algorithm is suggested. Un- 
der mild conditions, based on manifold theories and techniques, it can 
be shown that CISE would perform asymptotically as well as if the 
true irrelevant predictors were known, which is referred to as the ora- 
cle property. Simulation studies and a real-data example demonstrate 
the effectiveness and efficiency of the proposed approach. 

1. Introduction. Consider the regression of a univariate response y on p 
random predictors x = (x\, . . . , x p ) T € MP, with the general goal of inferring 
about the conditional distribution of y|x. When p is large, most statistical 
methods face the "curse of dimensionality," and thus dimension reduction 
is desirable. 

Sufficient dimension reduction (SDR) introduced by Cook (1994, 1998a) is 
important in both theory and practice. It strives to reduce the dimension of 
x by replacing it with a minimal set of linear combinations of x, without loss 
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of information on the conditional distribution of y|x. If a predictor subspace 
5cr satisfies 

y ±x|P 5 x, 

where X stands for independence and Pr.\ represents the projection matrix 
with respect to the standard inner product, then S is called a dimension 
reduction space. The central subspace S y i x , which is the intersection of all 
dimension reduction spaces, is an essential concept of SDR. Under mild 
conditions, it can be shown that S y \ x is itself a dimension reduction subspace 
[Cook (1994, 1998a)], which we assume throughout this article, and then it is 
taken as the parameter of interest. The dimension d of S y \ x , usually far less 
than p, is assumed to be known in this article. We also assume throughout 
that n> p. 

There has been considerable interest in dimension reduction methods 
since the introduction of sliced inverse regression [SIR; Li (1991)] and sliced 
average variance estimation [SAVE; Cook and Weisberg (1991)]. Li (1992) 
and Cook (1998b) proposed and studied the method of principal Hessian 
directions (PHD), and the related method of iterative Hessian transforma- 
tions was proposed by Cook and Li (2002). Chiaromonte, Cook and Li (2002) 
proposed partial sliced inverse regression for estimating a partial central sub- 
space. Yin and Cook (2002) introduced a covariance method for estimating 
the central kth moment subspace. Most of these and many other dimension 
reduction methods are based on the first two conditional moments and as 
a class are called F2M methods [Cook and Forzani (2009)]. They provide 
exhaustive estimation of S y \ x under mild conditions. Recently, Li and Wang 
(2007) proposed another F2M method called directional regression (DR). 
They argued that DR is more accurate than or competitive with all of the 
previous F2M dimension reduction proposals. In contrast to these and other 
moment-based SDR approaches, Cook (2007) introduced a likelihood-based 
paradigm for SDR that requires a model for the inverse regression of x on y. 
This paradigm, which is broadly referred to as principal fitted components 
(PFC), was developed further by Cook and Forzani (2009). Likelihood-based 
SDR inherits properties and methods from general likelihood theory and can 
be very efficient in estimating the central subspace. 

All of the aforementioned dimension reduction methods suffer because 
the estimated linear reductions usually involve all of the original predictors 
x. As a consequence, the results can be hard to interpret, the important 
variables may be difficult to identify and the efficiency gain may be less 
than that possible with variable selection. These limitations can be over- 
come by screening irrelevant and redundant predictors while still estimating 
a few linear combinations of the active predictors. Some attempts have been 
made to address this problem in dimension reduction generally and SDR 
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in particular. For example, Li, Cook and Nachtsheim (2005) proposed a 
model-free variable selection method based on SDR. Zou, Hastie and Tib- 
shirani (2006) proposed a sparse principal component analysis. Ni, Cook and 
Tsai (2005) introduced a shrinkage version of SIR, while Li and Nachtsheim 
(2006) suggested a sparse version of SIR. Li (2007) studied sparse SDR by 
adapting the approach of Zou, Hastie and Tibshirani (2006). Zhou and He 
(2008) proposed a constrained canonical correlation procedure (C 3 ) based 
on imposing the Li-norm constraint on the effective dimension reduction 
estimates in CANCOR [Fung et al. (2002)], followed by a simple variable 
filtering method. Their procedure is attractive because they showed that it 
has the oracle property [Donoho and Johnstone (1994), Fan and Li (2001)]. 
More recently, Leng and Wang (2009) proposed a general adaptive sparse 
principal component analysis and Johnstone and Lu (2009) studied the large 
p theory in sparse principal components analysis. 

However, most existing sparse dimension reduction methods are con- 
ducted stepwise, estimating a sparse solution for a basis matrix of the 
central subspace column by column. Instead, in this article, we propose 
a unified one-step approach to reduce the number of variables appearing in 
the estimate of S y \ x . Our approach, which hinges operationally on Grass- 
mann manifold optimization, is able to achieve dimension reduction and 
variable selection simultaneously. Additionally, our proposed method has 
the oracle property: under mild conditions the proposed estimator would 
perform asymptotically as well as if the true irrelevant predictors were 
known. 

We start in Section 2.1 by reviewing the link between many SDR meth- 
ods and a generalized eigenvalue problem disclosed by Li (2007). In Sec- 
tion 2.2, we describe a new SDR penalty function that is invariant under 
orthogonal transformations and targets the removal of row vectors from the 
basis matrix. Based on this penalty function, in Section 2.3, a coordinate- 
independent penalized procedure is proposed which enables us to incorporate 
many model-free and model-based SDR approaches into a simple and unified 
framework to implement variable selection within SDR. A fast algorithm, 
which combines a local quadratic approximation [Fan and Li (2001)] and 
an eigensystem analysis in each iteration step, is suggested in Section 2.4 
to handle our Grassmann manifold optimization problem with its nondif- 
ferentiable penalty function. In Section 2.5, we describe the oracle property 
of our estimator. Its proof differs significantly from those in the context 
of variable selection in single-index models [e.g., Fan and Li (2001), Zou 
(2006)] because the focus here is on subspaces rather than on coordinates. 
Results of simulation studies are reported in Section 3, and the Boston 
housing data, is analyzed in Section 4. Concluding remarks about the pro- 
posed method can be found in Section 5. Technical details are given in the 
Appendix. 
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Table 1 

The generalized eigenvalue formulations for principle component analysis (PC A), 
principle fitted component (PFC) models, sliced inverse regression (SIR), sliced average 
variance estimation (SAVE) and directional regression (DR) methods 



Method M N 



PCA E I p 

PFC E fl t E 

SIR Cov[£;{x-£;(x)|y}] E 

SAVE E 1/2 £[{I p -Cov(z|y)} 2 ]E 1/2 E 
DR ^ 2 {2E[E 2 (zz T \ y )] + 2E 2 {E(?,\y)E( 7 . T \y)] 

+ 2E[E{z,\y)E(7.\y)\E[E(z\y)E(z T \y)\ - 2l v }^' 2 E 



2. Theory and methodology. 

2.1. Motivation: Generalized eigenvalue problems revisited. Li (2007) 
showed that many moment based sufficient dimension reduction methods 
can be formulated as a generalized eigenvalue problem in the following form: 

(2.1) M n S ni = X ni N n S ni for i = l,...,p, 

where M n > is a method-specific symmetric kernel matrix, N n > is sym- 
metric, often taking the form of the sample covariance matrix S n of x; 
S n i, . . . , 8 np are eigenvectors such that <5^N n £ n j = 1 if i = j and if % ^ j 
and A n i > • • • > X np are the corresponding eigenvalues. We use the subscript 
"n" to indicate that S n , M n , N n and X n i are the sample versions of the 
corresponding population analogs M, N and Aj. Under certain conditions 
that are usually imposed only on the marginal distribution of x, the first 
d eigenvectors {8 n \ , . . . , 8 nc [}, which correspond to the nonzero eigenvalues 
A ? ii > • • • > X n d form a consistent estimator of a basis for the central sub- 
space. Letting z = Xl~ 1 / 2 {x — E(x)}. Many commonly used moment based 
SDR methods are listed in Table 1 with the population versions of M n 
and N n . 

Following Cook (2004), Li (2007) showed that the eigenvectors {S n \, . . . , 
S n d} from (2.1) can be obtained by minimizing a least square objective 
function. Let 

p 

(2.2) V = argmin^ HN" 1 !^ - VV T mi||^ n subject to V T N n V = I d , 

^ i=l 

1/2 

where rrij denotes the ith column of M n , % = 1, . . . ,p, V is a p x d matrix 
and the norm here is with respect to the N n inner product. Then Vj = d n j, 
j = 1, . . . ,d, where Yj stands for the jth column of V, so that span(V) 
is the estimator of the central subspace. To get a sparse solution, Li then 
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added penalties to the objective function in (2.2), leading to the optimization 
problem 

(a,V s )=min|^||N- 1 m i -aV T m l ||^ n +r 2 tr(V T N n V) + ^r 1 , j ||V j || 1 | 

subject to Q T N n Q; = Li, where tr(-) stands for the trace operator, || • || r 
denotes the L r norm, r 2 is some positive constant and t\j > for j = 1, . . . ,d 
are the lasso shrinkage parameters that need to be determined by some 
method like cross validation (CV). The solution V s is called the sparse 
sufficient dimension reduction estimator. As a result of the lasso constraint, 
V s is expected to have some elements shrunk to zero. 

We can see that Li's sparsity method is coordinate dependent because 
the L\ penalty term is not invariant under the orthogonal transformation of 
the basis and it forces individual elements of the basis matrix V s to zero. 
However, variable screening requires that entire rows of V s be zero, which is 
not the explicit goal of Li's method. To see this more clearly, partition x as 
(x^x^) 7 ", where xi corresponds to q elements of x and x 2 to the remaining 
elements. If 

(2.3) yXx 2 | Xl , 

then x 2 can be removed, as given xi, x 2 contains no further information 
about y. Let the p x d matrix r) be a basis for S y \ x and partition r\ = 
(*7i\ 7 72 , ) T m accordance with the partition of x. Then the condition (2.3) is 
equivalent to rj 2 = [Cook (2004)], so the corresponding rows of the basis 
are zero vector. 

In effect, Li's method is designed for element screening, not variable 
screening. Our experience reflects this limitation and reinforces the no- 
tion that V s may not be sufficiently effective at variable screening. In- 
spired by Li's method, we propose a new variable screening method — called 
coordinate-independent sparse estimation (CISE) — in the next subsection. 
We will show that CISE is simpler and more effective than Li's method at 
variable screening. 

CISE can be applied not only to moment based SDR approaches but also 
model based approaches. Cook (2007) and Cook and Forzani (2008) devel- 
oped several powerful model-based dimension reduction approaches, collec- 
tively referred to as principal fitted components (PFC). PFC-based SDR 
methods can also be formulated in the same way as (2.1), as summarized in 
the next proposition. In preparation, consider the following model for the 
conditional distribution of x given y, 

(2.4) x = /x + r^f(y) + A 1 / 2 e, 
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where /iGFisa location vector, T G W xd , T 1 T = l d , £ £ R dxr with rank 
d, f € W is a known vector-valued function of y, A = Var(x|y) > 0, and 
e S MP is assumed to be independent of y and normally distributed with 
mean and identity covariance matrix. 

Proposition 1 . Suppose the conditional distribution of x given y can 
be described by (2.4)- Then the maximum likelihood estimator (MLE) of S y \ x 
can be obtained through the generalized eigenvalue problem of the form (2.1) 
with M n = Sgt and N n = S n , where Sfit * s the sample covariance matrix 
of the fitted vectors from the linear regression of x on f . 

A commonly used case in the PFC models is A = cr 2 I p for a > 0, in which 

the MLE of S y ^ can be obtained through (2.1) with M n = Sfl t and N n = Ip- 
The covariates f (y) in model (2.4) usually take form of polynomial, piecewise 
or Fourier basis functions. Thus, the PFC models can effectively deal with 
the nonlinear relationship between the predictors and the response. 

2.2. A coordinate-independent penalty function. Let V = (vi, . . . , v p ) T 
denote apxd matrix with rows vf, i = 1, . . . ,p. In this section, we introduce 
a coordinate-independent penalty, depending only on the subspace spanned 
by the columns of V. Let q« be the vector in MP with the ith component 
one, else zero. 

We define a general coordinate-independent penalty function as 

i 

where 6% > serve as penalty parameters, and hi are positive convex func- 
tions defined in R d . To achieve variable screening, the functions hi must be 
nondifferentiable at the zero vector. It is clear that the function <j) is indepen- 
dent of the basis used to represent the span of V, since for any orthogonal 
matrix O, </>(V) = 0(VO). In fact, any penalty function defined on VV T 
meets our requirement. 

Given h\ = ■ ■ ■ = h p = y^J, we have a special coordinate-independent 
penalty function: 

v 

(2-5) p(V) = ^|H| 2 . 

i=l 

A method for selecting the tuning parameters will be discussed in Section 2.6. 
We can see that the penalty function p has the same form as the group 
lasso proposed by Yuan and Lin (2006) but their concepts and usages are 
essentially different. Through this article, we shall use only p in application 
and theory to demonstrate our ideas. 
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Penalty (2.5) is appealing for variable selection because it is independent 
of the basis used to represent the span of V, p(V) = p(VO) for any orthog- 
onal matrix O, and because it groups the row vector coefficients of V. This 
motivated us to consider the regularized function (2.5) that can shrink the 
corresponding row vectors of irrelevant variables to zero. Another appeal- 
ing feature of using this penalty is its oracle property, which is discussed in 
Section 2.5. 

2.3. Coordinate-independent sparse estimation. Recall the generalized 
eigenvalue problem (2.1) and the associated notation. Formally, 

v 

]T HN-V - VV T mi \\ 2 Nn = tr(G n ) - tr(V T M n V), 

i=X 

j/2 — 1/2 

where G n = N n M n N n and we use G to denote its population analog 
in what follows. Hence, the ordinary sufficient dimension reduction estima- 
tion (OSDRE) given in (2.2) is 

(2.6) V = arg min - tr( V T M n V) subject to V T N n V = I d . 

v 

By using the coordinate independent penalty function given in last sub- 
section, we propose the following coordinate-independent sparse sufficient 
dimension reduction estimator (CISE): 

(2.7) V = argmin{- tr(V T M n V) + p(V)} subject to V T N n V = I d , 

v 

where p(V) is defined in (2.5). 

The solution V is not unique as VO is also a solution for any orthogonal 
matrix O. In a strict sense, we are minimizing (2.7) over the span of the 
columns of V. Thus, V denotes any basis of the solution of (2.7). Analo- 
gously, the solution V is one basis of the solution of (2.6). Before proceeding, 
we rewrite (2.6) and (2.7) as equivalent unitary constrained optimization 
problems which will facilitate our exposition. We summarize the result into 
the following proposition without giving its proof since it follows from some 
straightforward algebra. 

— 1 /2 ^ 

Proposition 2. The minimizer (2.6) is equivalent to V = N n T where 

(2.8) f = arg min - tr(r T G n T) subject to T T T = I d . 

r 

Furthermore, G n F = TA n i, where A n i = diag(A n i, . . . , X nd ). Correspond- 

~ —1/2 ~ 

ingly, the minimizer (2.7) is equivalent to V = N n T, where 

(2.9) f = argmin{- tr(r T G n r) + ^(N" 1 /^)} subject to T T T = I d . 

r 
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The minimization of (2.8) and (2.9) is a Grassmann manifold optimiza- 
tion problem. A Grassmann manifold, which is defined as the set of all 
c?-dimensional subspaces in M. p , is the natural parameter space for the T 
parametrization in (2.8). For more background on Grassmann manifold opti- 
mization, see Edelman, Arias and Smith (1998). The traditional Grassmann 
manifold optimization techniques cannot be applied directly to (2.9) due to 
the nondifferentiability of p(-). Nevertheless, we have devised a simple and 
fast algorithm to solve (2.9), as discussed in the next subsection. 

2.4. Algorithm. To overcome the nondifferentiability of p(-), we adopt 
the local quadratic approximation of Fan and Li (2001); that is, we approx- 
imate the penalty function locally with a quadratic function at every step 
of the iteration as follows. 

Let V(°) = (vS 0) ,...,vJ, 0) ) T = N~ 1/2 f(°) be the starting value. The un- 
constrained first derivative of p(V) with respect to the p x d matrix V is 
given by 

dp ( 0i 0i e p v 

^tt = drag - — — V. 
<?V Vll v l||2 ||Vj|| 2 ||v p ||2/ 

Following Fan and Li, the first derivative of p(V) around V^ ^ can be 
approximated by 

w diag W r ' W 

Ml v l II 52 IK P IK, ; || 2 / 

By using the second-order Taylor expansion and some algebraic manipula- 
tion, we have 

p(V) « i tr(V T HW V) + Co = \ trC^N-^H^N-^r) + C , 

where Co stands for a constant with respect to V. 
Then find f « by minimizing: 

-tr(r T G n T) + itr^N-^nWN-^r) 

= tr{r r (-G n + iN ( ; 1 /2 H (o) N -i/2 )r} . 

This minimization problem can be easily solved by the eigensystem analysis 
of the matrix G n — 2 _1 N n 1 ^ 2 H(°)N n 1 ^ 2 , that is, the columns of f W are the 
first d principal component directions of G n — 2~ 1 N n 1 ^ 2 H^°^N n 1 ^ 2 . Next, let 
V^ 1 ) = 'N n 1 ^ 2 T ^ and start the second round of approximation of p(V). The 

(k) a 

procedures repeat until it converges. During the iterations, if ||v: H2 ~ 0, say 

Il v ^||2 < e where e is a prespecified small positive number (e.g., e = 10~ 6 ), 
then the variable x; is removed. 
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With respect to the choice of the initial values f(°), a simple but effective 
solution is to use f (°) = T, the minimizer of (2.8). With T as the initial 
values, we found that the frequency of nonconvergence is negligible in all of 
our simulation studies and the convergence is quite fast, usually requiring a 
few dozen iterations. A Matlab interface was used to implement this CISE 
algorithm. The programs can be obtained from the first author upon request. 

2.5. Oracle property. In what follows, without loss of generality, we as- 
sume that only the first q predictors are relevant to the regression, where 
d<q<p. Given apxd matrix K, K( 9 ) and K( p _ g ) indicate the sub-matrices 
consisting of its first q and remaining p — q rows. If K is p x p, then the nota- 
tion indicates its first q and the last p — q block sub- matrices. In the context 
of the single-index model, Fan and Li (2001) and Zou (2006) have shown 
that, with the proper choice of the penalty functions and regularization pa- 
rameters, the penalized likelihood estimators have the oracle property. With 
continuous penalty functions, the coefficient estimates that correspond to 
insignificant predictors must shrink toward as the penalty parameter in- 
creases, and these estimates will be exactly if that parameter is sufficiently 
large. In this section, we present theorems which establish the oracle prop- 
erty of CISE. 

Let a n = m&x{9j,j < q} and b n = min{9j, j > q}, where the 6j's are the 
penalty parameters defined in Section 2.2, let Ai > • • • > A p > denote the 
eigenvalues of G, and define the matrix norm ||V|| S = y tr(V T V). We also 
require a metric D in the set of all subspaces of M p . 

Definition 1. The distance between the subspaces spanned by the 
columns of V n and V, denoted as D(V n ,V), is defined as the square root 
of the largest eigenvalue of 

(JV n -iV) T (iV„-iV). 

This distance criterion was first used by Li, Zha and Chiaromonte (2005) 
in the sufficient dimension reduction setting. See Gohberg, Lancaster and 
Rodman (2006) for more details. We use the following assumptions to es- 
tablish the oracle property. 

Assumption 1. Let Vo denote the minimizer of (2.6) when the popula- 
tion matrices M and N are used in place of M n and N n . Then Vo( p _ g ) = 0. 

Assumption 2. M n = M + C^n" 1 / 2 ) and N n = N + O p {n~ 1 / 2 ). 

Given some mild method-specified conditions, the minimizer of (2.6) V 
is a consistent estimator of a basis of the central subspace. For example, 
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SIR provides the consistent estimate of the central subspace given that the 
linearity and coverage conditions hold [Cook (1998a), Chiaromonte, Cook 
and Li (2002)]. Consequently, the population version Vo will be a basis of 
the central subspace. Therefore, Assumption 1 is a reasonable one which 
facilitates our following presentations. Assumption 2 is mild and typically 
holds. These two assumptions suffice for our main results. 

We state our theorems here, but their proofs are relegated to the Appendix. 
The constrained objective function in the minimization problem (2.7) is de- 
noted as Q(V; M n ) := /(V; M n ) + p(V) where /(V; M n ) = - tr(V T M„V). 
The first theorem establishes existence of CISE. 

Theorem 1. If Assumptions 1 and 2 hold, Xd > Xd+i and y/na n A ; 
then there exists a local minimizer V n ofQ(V; M n ) subject to V r N n V = Id, 
so that 

D(V n ,V ) = O p (n- 1 / 2 ). 

It is clear from Theorem 1 that by choosing the 0j's properly, there exists 
a root-n consistent CISE. The next transition theorem states an oracle-like 
property of CISE. 

Theorem 2. If Assumptions 1 and 2 hold, Xd > Xd+i, \fna n A and 
\Jnb n A oo, then the root-n consistent local minimizer V n in Theorem 1 
must satisfy: 

(i) Pr(V n(r<?) =0)^l, 

(ii) \fnD(V n r q \,'V n (o)) = o p (l), where V n( - ) is the minimizer of Q(V; 
M n ( 9) ) subject to V T N n( g)V = Li. 

Theorem 2(i) states that with probability tending to 1, all of the zero 
row of Vo must be estimated as 0. Theorem 2(ii) tells us that there ex- 
ist a local minimizer V n so that the difference between its nonzero sub- 
matrix V n (g) and V n (p) is of order o p (n~ 1 / 2 ). That is to say, we have 

the result that y/nD(V n ^, V ( 9 )) has the same asymptotic distribution as 

■\/nD{y n rQ\, V ( g )). With respect to the asymptotic distribution of V n ( ), 
there seems to be no general result in the literature because different spec- 
ifications on M n ( 9 ) and N n r q \ yield different asymptotic distributions. This 
is not of great interest here and we refer to Zhu and Ng (1995), Li and Zhu 
(2007) and the references therein. 

The second part of Theorem 2 is actually valid in a generalized sense. The 
OSDRE in the exact oracle property, denoted as ~V n ro) , is obtained by using 
the q x q M n and N n formed with the first q variables (denoted as M n (Q) and 
N n (0))- Usually, N n ( ) = N n ( 3 ). From the definition, it is straightforward to 
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see that M n (Q) = M n ( g ) for the PCA, SIR and PFC methods. Thus, in these 
cases, Theorem 2 establishes the exact oracle property. We conjecture that 
M n (Q) should be very close to M n ( g ) for any SDR method that satisfies 
Assumptions 1 and 2. From the proof of Theorem 2(ii), we can conclude 
that if 

(2-10) \\M n{0) -M n{q) \\ s = O p (a n ), 

the exact oracle property still holds. The next result establishes that the 
condition above holds for DR and SAVE under certain conditions. 

Proposition 3. Suppose the linearity and constant variance conditions 
[Li and Wang (2007)] hold and (nan) -1 = O p (l). Then condition (2.10) is 
satisfied for the DR and SA VE methods. 

By this proposition, Theorem 2 and the discussion above, we know that 
from asymptotic viewpoints CISE is effective for all of the commonly used 
SDR methods. We summarize this major result in the following theorem. 

Theorem 3. Assume that the conditions in Theorem 2 and Propo- 
sition 3 hold. Then the exact oracle property is achieved for the PCA, 
SIR, PFC, SAVE and DR methods. That is, V n has the selection con- 
sistency and v / nD(V n ( IJ ),Vo( g )) has the same asymptotic distribution as 
VnD{Y n[0) ,\ m ). 

In this paper, we make no attempt to further analysis general conditions 
for the validity of (2.10), but we think that such studies certainly warrant 
future research. 

2.6. Choice of tuning parameters. We recommend using 
(2.11) e i = 9\\v i \\^ r , 

where Vj is the ith row vector of the OSDRE V defined in (2.6), and r > is 
some pre-specified parameter. Following the suggestions of Zou (2006), r = 
0.5 is used in both the simulation study and the illustration in Section 4. Such 
a strategy effectively transforms the original p-dimensional tuning parameter 
selection problem into a univariate one. By Lemma 2 in the Appendix, Vj 
is root-n consistent. Thus, it is easily to verify that the tuning parameter 
defined in (2.11) satisfies the conditions on a n and b n needed by Theorem 2 as 
long as y/n0 — > and Hence, it suffices to select 9 £ [0, +oo) 

only. 

To choose the tuning parameter 9, we use the following criterion which 
has a form similar to ones used by Li (2007) and Leng and Wang (2009): 

-tr(VjM n V e ) + 7 -df e , 
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where Vg denotes the solution for V given 9, dig denotes the effective num- 
ber of parameters, and 7 = 2/n for AlC-type and 7 = log(n)/n for BIC- 
type criteria. Following the discussion of Li (2007), we estimate dig by 
(pg — d) ■ d where pg denotes the number of nonzero rows of Vg because we 
need (pg — d)-d parameters to describe a (i-dimensional Grassmann manifold 
in R P( > [Edelman, Arias and Smith (1998)]. 

3. Simulation studies. We report the results of four simulation studies in 
this section, three of which were conducted using forward regression models 
and one was conducted using an inverse regression model. We compared our 
method with the C 3 method [Zhou and He (2008)] and the SSIR method [Ni, 
Cook and Tsai (2005)]. BIC and RIC [Shi and Tsai (2002)] were used in SSIR 
to select the tuning parameters, and two a levels (0.01 and 0.005) were used 
in the C 3 method. We used SIR and PFC to generate M„ and N n for CISE 
selection. For these methods, denoted CIS-SIR and CIS-PFC, we report only 
the results using the BIC criterion to select tuning parameters as we tend 
to believe that BIC has consistency property. Unreported simulations using 
the RIC criterion show slightly better performance in some cases though. 

In each study, we generated 2500 datasets with the sample size n = 60 
and n = 120. For the C 3 method, the quadratic spline with four internal 
knots was used, as suggested by Zhou and He (2008). Six slices were used 
for the SSIR method. We calculated M n in the PFC model setting using 
f(u) = {\y\iV->y 2 ) T f° r au simulation studies. 

We used three summary statistics — 77 , r 2 and — to assess how well the 
methods select variables: 77 is the average fraction of nonzero rows of V 
associated with relevant predictors; r 2 is the average fraction of zero rows 
of V associated with irrelevant predictors; and is the fraction of runs 
in which the methods select both relevant and irrelevant predictors exactly 
right. 

Study 1. 

y = 27 + x 2 + x 3 + 0.5e, 

where e~iV(0,l), x = (27, . . . , x 2A ) T ~ N(0, S) with S y = 0.5^1 for 1 < 
i,j < 24, and x and e are independent. In this study, the central subspace 
is spanned by the direction (3 1 = (1, 1, 1, 0, . . . , 0) T with twenty-one zero co- 
efficients. 

Study 2. 

y = xi+x 2 + X3 + 2e, 

where e~iV(0,l), x = (37, . . . , x 24 ) T ~ N(0, S) with = 0.5^1 for 1 < 
i,j < 24, and x and e are independent. In this study, the central subspace is 
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spanned by the direction /3 1 = (1,1,1,0,..., 0) T with twenty-one zero coef- 
ficients. In short, this study was identical to the first, except the error was 
increased by a factor of 4. 

Study 3. 

y = xi/{0.5 + (x 2 + 1.5) 2 } + 0.2e, 

where e~JV(0,l), x = (x u . . . , x 2 <i) T ~ N(0, E) with E„ = 0.5^ for 1< 
i,j < 24, and rr and e are independent. In this study, the central subspace is 
spanned by the directions j3 l = (1, 0, . . . , 0) T and (3 2 = (0, 1, . . . , 0) T . 

Study 4. 

x = r( y ,y 2 f + A 1 /2 e , 

where e ~ JV(0,I 2 4), V ~ JV(0, 1), Ay = 0.5^1 for 1 < i,j < 24, and y and 
e are independent. The first column of Y is (0.5,0.5,0.5,0.5,0,... ,0) T and 
the second column of Y is (0.5, —0.5, 0.5, —0.5, 0, . . . , 0) T . In this study, the 
central subspace is the column space of A _1 T. 

The simulation results from these four studies are summarized in Ta- 
bles 2-5, respectively. The standard errors of the rj-'s, yjr}~{l — r^)/50, are 
typically less than 0.01 throughout this section. In Study 1, the signal-to- 
noise ratio is close to 5 (the ratio of the stand deviation of 2:1+2:2 + X3 
to 0.5). Because of the large signal-to-noise ratio, all the considered meth- 
ods show very good performance, but CIS-SIR, CIS-PFC and C 3 perform 
slightly better than SSIR. In Study 2, we decreased the signal-to-noise ra- 
tio to about 1.2 and now CIS-SIR and CIS-PFC perform much better than 
C 3 and SSIR. In both Studies 3 and 4, CISE is generally superior to the 
other two methods, especially for CIS-PFC and the rate r$. It should be 
pointed out that the superiority of CISE becomes more significant when 
n gets larger. When n = 120, C 3 still cannot perform exact identifications 
well, while SSIR rarely identifies all relevant and irrelevant variables cor- 
rectly. 

While both CISE and C 3 have the oracle property, they differ in many 
aspects. CISE is a unified method that can be applied to many popular 
sufficient dimension reduction methods, including PCA, PFC, SIR, SAVE 
and DR. On the other hand, C 3 is based on one specified sufficient dimension 
reduction method, canonical correlation [Fung et al. (2002)]. We regard r$, 
the estimated probability all relevant and irrelevant variables are identified 
correctly, as the most important aspect of a method. On that measure CISE 
typically dominates C 3 . There was only one case (Table 1, n = 60) in which 
C 3 did slightly better than CISE. Additionally, CISE seems conceptually 
simpler and is easily implemented. 
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Table 2 
Summary of Study 1 



Method: 


CIS-SIR 


CIS-PFC 




c 3 




SSIR 


Criterion: 


BIC 


BIC 


a = 0.01 


a 


= 0.005 


BIC 


RIC 


Sample size 






n = 


60 








ri 


0.991 


1.000 


1.000 




1.000 


0.993 


0.974 


r-i 


0.999 


1.000 


0.999 




0.999 


0.997 


0.999 


r:>, 


0.970 


1.000 


0.978 




0.991 


0.939 


0.914 


Sample size 






n = 


120 








ri 


1.000 


1.000 


1.000 




1.000 


1.000 


1.000 


ri 


1.000 


1.000 


1.000 




1.000 


0.999 


1.000 


rz 


1.000 


1.000 


1.000 




1.000 


0.994 


1.000 



Table 3 
Summary of Study 2 



Method: 


CIS-SIR 


CIS-PFC 




C 3 




SSIR 


Criterion: 


BIC 


BIC 


a = 0.01 


a 


= 0.005 


BIC 


RIC 


Sample size 






n = 


60 








ri 


0.713 


0.795 


0.583 




0.565 


0.770 


0.706 


'-12 


0.988 


0.992 


0.998 




0.998 


0.881 


0.939 




0.233 


0.399 


0.075 




0.080 


0.058 


0.104 


Sample size 






n = 


120 








ri 


0.909 


0.951 


0.669 




0.615 


0.973 


0.930 


ri 


0.998 


0.998 


1.000 




1.000 


0.928 


0.981 


r,', 


0.694 


0.827 


0.209 




0.131 


0.244 


0.554 



Table 4 
Summary of Study 3 



Method: 


CIS-SIR 


CIS-PFC 




C 3 




SSIR 


Criterion: 


BIC 


BIC 


a = 0.01 


a 


= 0.005 


BIC 


RIC 


Sample size 






n = 


60 








n 


0.789 


0.906 


0.770 




0.742 


0.934 


0.888 


1-2 


0.965 


0.979 


0.948 




0.955 


0.633 


0.828 




0.344 


0.588 


0.229 




0.226 


0.000 


0.004 


Sample size 






n = 


120 








/'i 


0.948 


0.995 


0.839 




0.781 


0.994 


0.983 


r> 


0.992 


0.998 


0.956 




0.963 


0.664 


0.865 


r-s 


0.838 


0.973 


0.309 




0.245 


0.001 


0.027 
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Table 5 
Summary of Study 4 



Method: 


CIS-SIR 


CIS-PFC 




C 3 




SSIR 


Criterion: 


BIC 


BIC 


a = 0.01 


a 


= 0.005 


BIC 


RIC 


Sample size 






n = 


60 








ri 


0.676 


0.817 


0.670 




0.643 


0.871 


0.776 


r-i 


0.968 


0.989 


0.956 




0.958 


0.641 


0.832 


r:>, 


0.069 


0.327 


0.022 




0.029 


0.000 


0.000 


Sample size 






n = 


120 








ri 


0.805 


0.928 


0.828 




0.809 


0.988 


0.964 


ri 


0.993 


0.998 


0.967 




0.969 


0.696 


0.890 


rz 


0.299 


0.687 


0.147 




0.178 


0.000 


0.000 



4. Boston housing data. 

4.1. Variable screening. We applied our method to the Boston housing 
data, which has been widely studied in the literature. The Boston housing 
data contains 506 observations, and can be downloaded from the web site 
http://lib.stat.cmu.edu/datasets/boston_corrected.txt. The re- 
sponse variable y is the median value of owner-occupied homes in each 
of the 506 census tracts in the Boston Standard Metropolitan Statistical 
Areas. The 13 predictor variables are per capita crime rate by town (xi); 
proportion of residential land zoned for lots over 25,000 sq.ft (X2); propor- 
tion of nonretail business acres per town (^3); Charles River dummy variable 
(#4); nitric oxides concentration (#5); average number of rooms per dwelling 
(xq); proportion of owner-occupied units built prior to 1940 (#7); weighted 
distances to five Boston employment centers (x$); index of accessibility to 
radial highways (xg); full- value property-tax rate (x_o); pupil-teacher ratio 
by town (xn); proportion of blacks by town (#12); percentage of lower status 
of the population (£13). 

Previous studies suggested that we remove those observation with crime 
rate greater than 3.2, as a few predictors remain constant except for 3 ob- 
servations in this case [Li (1991)]. So we used the 374 observations with 
crime rate smaller than 3.2 in this analysis. All the methods considered in 
Section 3 were applied to this dataset. Scatter-plotting of each predictor 
against y, we concluded that it would be sufficient to use f = {y/y, y,y 2 ) T 
in the PFC model. Since PFC is a scale-invariant method, we did not stan- 
dardize the data as many other methods do. Similar to the previous studies 
in the literature, we pick up two directions to estimate the central subspace. 
The estimated bases of the central subspace for all the considered methods 
are summarized in Table 6. 
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Table 6 

Estimated bases of the central subspace in Boston housing data 



Method: 


CIS-SIR 


CIS-PFC 


C 3 


SSIR-BIC 


SSIR-RIC 


•('i 











-0.050 -0.131 


-0.041 -0.123 




-0.004 -0.047 








-0.001 -0.002 


-0.001 -0.001 


x 3 











0.001 0.005 





X4 











-0.033 0.020 
















0.719 -0.882 


0.543 -0.765 


xe 


-0.999 0.034 


-0.999 0.034 


0.962 -0.645 


-0.684 -0.448 


-0.834 -0.627 


X7 


-0.008 -0.139 


-0.003 -0.077 


-0.174 -0.096 


0.006 -0.001 


0.005 -0.001 


Xs 











0.082 -0.012 


0.060 -0.010 


■>■•) 











-0.019 0.035 


-0.016 0.033 


Xio 


-0.001 -0.01 


-0.002 -0.035 


-0.166 


0.001 -0.001 


0.001 -0.001 


Xu 


0.021 -0.361 


0.018 -0.280 


-0.126 


0.058 -0.033 


0.055 -0.036 


Xl2 


0.001 0.011 


0.002 0.035 





-0.000 0.000 





Xl 3 


-0.044 -0.920 


-0.040 -0.955 


-0.758 


0.014 -0.043 


0.017 -0.059 



The coefficients in Table 6 from CIS-SIR, CIS-PFC and SSIR are based 
on the original dataset, while the coefficients of C 3 is based on a data- 
specific weighted version [Zhou and He (2008)]. As suggested by CIS-PFC, 
explanatory variables xq, xj, xio, Xu, £12 and X13 would be important in 
explaining y. 

4.2. Bootstrap study. In Table 7, we used the bootstrap to assess the 
accuracy of variable selection for all methods except C 3 , as it is not clear 
how the weighting procedure used by Zhou and He should be automated. 
Without weighting we encountered serious convergence problems in the C 3 
algorithm. This bootstrap study can be considered as another simulation 
study. 

The bootstrap procedure was conducted as follows. First, we randomly 
chose with replacement 374 observations for y jointly with xq, xj, xiq, xu, 
x\2 and X13. Secondly, we separately randomly selected 374 observations for 
x\, X2, X3, X4, X5, xq and xg. Then we combine them to make one complete 
bootstrap dataset. In this way, we mimic the results of the analysis of original 
data, forcing x±, X2, £3, £4, £5, xg and £9 to be irrelevant. This procedure 



Table 7 

Variable selection in bootstrapping Boston housing data 



Method: 


CIS-SIR 


CIS-PFC 


SSIR-BIC 


SSIR-RIC 


n 


0.947 


0.962 


0.963 


0.877 


r-2 


0.969 


0.980 


0.780 


0.952 


T3 


0.550 


0.672 


0.118 


0.264 
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was repeated 2500 times. The resulting rates n, ri and r$ are shown in 
Table 7. The results show a pattern similar to those in simulation studies 
and again CISE performed quite well. 

5. Discussion. The establishment of the oracle property in this paper takes 
advantage of the simple trace form of the objective function: — tr(V T M„V) . 
However we believe that the proof in the Appendix can be extended to 
more general objective functions. Moreover, it is also of great interests to 
see whether CISE and its oracle property are still valid in high-dimensional 
settings in which p> n. 

We have seen that N n usually takes the form of the marginal sample co- 
variance matrix of x, while M n depends on the specific method. In practice, 
how to choose M ra for variable selection is an important issue and merits 
thorough investigation. In addition, it is well demonstrated that for the mul- 
tiple regression model, the BIC criterion tends to identify the true sparse 
model well if the true model is included in the candidate set [Wang, Li and 
Tsai (2007)]. The consistency of the BIC criterion proposed in Section 2.6 
deserves further study as well. 

APPENDIX 

Throughout this section, we will use the following notation for ease of 
exposition. Q(T;G n ,N n ) := -tr(r T G n r) + p(N^ 1/2 r) denotes the con- 
strained objective function in the minimization problem (2.9). Unless oth- 
erwise stated, we also use the generic notation Q(T) or Q(V) to represent 
the function Q(r;G n ,N n ) or C}(V;M n ) for abbreviation, which should not 
cause any confusion, lj denotes a row vector with one in the ith position 
and zero in the others. 

PROOF of Proposition 1. Cook (2007) has shown that the maximum 
likelihood estimator of span(A _1 r) in the general PFC model equals the 

— 1/2 

span of {ei, . . . , e^}, where ej = S n rj and r, is the ith eigenvector of 

— 1/2 ^ 1/2 

S n Sg t S n corresponding to the eigenvalue fcj. Consequently, we have 

It follows that M n = E flt and N n = E n . □ 

In order to prove the theorems, we first state a few necessary lemmas. For 
notation convenience, we need the following additional definitions. Define 
the Stiefel manifold St(p, d) as 

st(p, d) = {v e M pxd : r T r = i d }. 

Denotes |_rj as the subspace spanned by the columns of T, then |_rj € 
Gr(p, d) where Gr(p, d) stands for the Grassmann manifold. The projection 
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operator R : M. pxd — > St(p, d) onto the Stiefel manifold St(p, d) is defined to be 

R(T) = argmin ||r~ W|| 2 . 
west( P ,d) 

The tangent space Tr(p,d) of T S St(p, d) is defined by 
T T (p,d) = {ZGM pxd :Z = rA + r ± B, 

(A.l) 

A e R dxd , A + A T = 0, B e M (p " d)x,i }, 
where T ± £ ]RP x (P- d ) is the complement of T satisfies [IT J T [IT J = I p . 

Lemma 1. If Z eT r (p,d),T e St(p,d), we have: 

(i) For any symmetric matrix C £ M dxrf , tr(Z r rC) = 0. 

(ii) R(T + tZ) = T + tZ- (i/2)t 2 rz T z + o(t 3 ). 

This lemma comes from Lemma 10 and Proposition 12 of Manton (2002). 

Lemma 2. Under conditions in Theorem 1, we have 

D(T,T ) = O p (n~ 1 / 2 ), 

where Tq denotes any minimizer of (2.8) when G n is taken as the population 
matrix G. 

This lemma can be proved in a similar fashion to the proof of Theorem 1 
and hence omitted here. 

Proof of Theorem 1. Clearly, to prove this theorem is equivalent to 
show there exists a local minimizer T n of Q(T; G n , N n ) subject to T T T = 1^, 
so that 



D(t n ,T ) = O p (n 



Denote T* as an orthonormal basis matrix of the subspace spanned by the 
columns of N^Vo- Thus, there exists a positive-definite matrix O S M rfxci 
so that r* = Ny 2 VoO. By Assumption 2 and V^NVo = Id, we have 

T = l d + O p {n- 1 / 2 ). 

Note that To = N 1 / 2 Vo, and thus it is equivalent to show that 

D(t n ,T,) = O p (n- 1 / 2 ), 

since -D(r*,ro) = O p (n~ 1 / 2 ) and D(-,-) satisfies the triangle inequality. 

To ease demonstration, we need define the concept of the neighborhood 
of [r*J . For an arbitrary matrix W £ W xd and scaler 5sM, the perturbed 
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point around in Stiefel manifold can be expressed by R(T* + <5W). The 
perturbed point around |_r*J in Grassmann manifold can be expressed by 
L-R(r* + 5W)J . According to Lemma 8 of Manton (2002), W can be uniquely 
decomposed as 

w = r*A + r* ± B + r*c, 

where A G M. dxd is a skew-symmetric matrix, B G R(P _rf ) xrf is an arbitrary 
matrix, and C G M. dxd is a symmetric matrix. Let Z = T^A + r*^B. Obvi- 
ously, Z G TY,(p,d). Henceforth, Z refers to the projection of an arbitrary 
matrix W G M. pxd onto the tangent space Tr„(p, <i), unless otherwise stated. 
From Proposition 20 of Manton (2002), it is straightforward to see 

[R(T* + SW)\ = [R(T* + 5(T*A + r„ ± B + r*C))J 

= [RiT^Ia + <5(A + C)) + 5T, ± B)\ 

= lr*(I d + 5(A + C)) + 5T* ± B\ 

= [r, + <5r* ± B(I d + 5(A + C))" 1 ] 

= [R(T, + 5T, ± B')\, 

provided that 5 is sufficiently small so that Id + <5(A + C) is a full rank 
matrix, where B' = B(Lj + 5(A + C)) _1 . Since B G R(P~ d ^ xd is an arbitrary 
matrix and we do not need the specific form of B and B' in our proof, we 
only use B for notation convenience. This tells us that the movement from 
[r^J in the near neighborhood only depends on the T^B. In other words, 
it suffices to only consider perturbed points like R(T* + <5Z) in the following 
proofs, where ||B|| S = C for some given C. It is worth noting that though 
our problems essentially are Grassmann manifold optimization, we prove the 
theorem in a more general way, say in Stiefel manifold [using Z G 2r„ (p, d)] 
since the latter has simpler matrix expressions and thus is more notationally 
convenient. 

For any small e, if we can show that there exits a sufficiently large con- 
stant C, such that 

(A.2) limPrf inf Q(R{T* + n~ 1/2 Z)) > Q(T*)) > 1 - e, 

n VzeT r »(p,d): [|B|| S =C J 

then we can conclude that there exists a local minimizer Y n of Q(T) with 
arbitrarily large probabilities such that ||f n — r*|| s = O p (n~ 1 / 2 ). This cer- 
tainly implies that D(T n ,T*) = O p (n -1 / 2 ) by Definition 1. 
By using Lemma 1, for Z G Tr t (p, d) we have 

n{Q(R(T, + n~ 1/2 Z)) - Q(T*)} 

= [-tr(Z T G„Z)-2^tr(Z T G n r,)+tr(Z T Zr^G n r,)](l + o p (l)) 
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i=i 



liN-Vafr. 



n~ 1/2 Z- -n _1 r*Z T Z 



-^•Hlj-N-V^Ha (l + o p (l)) 
> [- tr(Z T G„Z) - 2^tr(Z T G„r,) + tr(Z T Zr^G n r,)] (1 + o p (l)) 

+ n XJ iyN- 1 / 2 ^r*+ri- 1/2 z-^n- 1 r*z T z^ 



l.-N-v^r 



* 2 



(1+^(1)) 



> [- tr(Z T G n Z) + tr(Z r Zr^G n r,) - 2v^tr(Z T G n r,)](l + o p (l)) 
x maxllllj-N-^r.Ha 1 • 111,-N" 1 / 2 ^ - (l/2)n- 1 / 2 r,Z T Z)|| 2 } 

3 

= (A 1 + A 2 )(l + 0p (l)), 

—1/2 

where the second inequality holds because ljN n T* = for any j > q 
by Assumption 1, and the last inequality comes from first-order Taylor ex- 
pansion and the definition of a n . In addition, according to the theorem's 
condition ^/na n A 0, we known that A 2 is o p (l). Furthermore, based on 
Lemma 1 and Assumption 2, we have 

V^tr(Z T G n I\) = V^tr(Z T Gr O) + ^r^G^N^N" 1 / 2 - G)r O) 
= V^tr(Z T r AiO) + v^tr(Z T (G„ - G)r O) 

+ V^tr(Z T G n r O) • Opin" 1 / 2 ) 
= v^tr(Z T (G n - G)r O) + O p (n~ 1 ' 2 ) 
= ^tr(A T r^(G n - G)r O) 

+ v^tr(B T r^(G n - G)r O) + Opin- 1 ' 2 ) 

= ^tr(B T r^(G n - G)r )(l + Opin- 1 ' 2 )), 

where A = diag{Ai, A 2 } is the diagonal eigenvalue matrix of G with the 
first d x d sub-matrix Ai. By using the definition of Z in (A.l), we get 

tr(Z T Zr^G„r,)-tr(Z T G n Z) =tr(Z T ZOr^Gr O)-tr(Z T GZ)+O p (?i- 1/2 ) 

= tr(Z T ZAi) - tr(Z T GZ) + O p (?i~ 1/2 ) 
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= tr(A T AAi) + tr(B r BAi) - tr(BB T A 2 ) 

-tr(AA r Ai) + o p (l) 

>(A d -A d+1 )||B||2, 

where we use the fact tr(A r AAi) — tr(AA T Ai) = because A is skew- 
symmetric. Here the last inequality follows from basic properties of trace 
operator for semi-positive definite matrix. As a consequence, by the Cauchy- 
Schwarz inequality for trace operator, the third term in Ai is uniformly 
bounded by ||B|| S x ||- v /n(G n — G)ro|| s . Therefore, as long as the constant 
C is sufficiently large, the first two terms in Ai will always dominate the 
third term and A2 with arbitrarily large probabilities. This implies inequal- 
ity (A. 2), and the proof is complete. □ 

Proof of Theorem 2. (i) To prove this part, we need represent (2.7) 
as vector forms. Define 

* = (*Tj • • • ) T ' 

fc,(t) = t T C { t, l = l,...,d, 

h kl (t)=t T C kl t, (k,l)€j, 

J = {(k,l)\k,l = l,...,d,k<l}, 

where tj denotes the ith column vector of V, C/'s are pd x pd block-diagonal 
matrices, C/^'s pd x pd block matrices, C/ and Cf-i contain N n in the ith 
diagonal block and in the (k, I) as well as (I, k) blocks, respectively. The 
pd x pd symmetric matrices C k i are defined for all the pairs of different 
indices belonging to J , given by the d(d— l)/2 combinations of the indices 
l,...,d. 

By this notation, we have 

v 

Q(T) := Q*(t) = -t T At + ^2 OilWih, 

i=i 

where A is a pd x pd block-diagonal matrix with all diagonal blocks M n . Of 
course, in the above equation each v, is regarded as a function of t. 

By using the equality representation of the compact Stiefel manifolds 
St(p, d), (2.7) is equivalent to 

nun-|t T At + 5^ei||Vi|| 2 | 

(A.3) 

subject to hi(t) = 1, 1 = 1 G [1, d] and h ki (t) = 0, (k, I) G J. 

As a consequence, this enables us to apply an improved global lagrange 
multiplier rule proposed by Rapcsak (1997). 
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We start by supposing that Vj ^ for all j. According to Theorem 15.2.1 
in Rapcsak (1997) [or Theorem 3.1 in Rapcsak (2002)], a necessary condition 
that t n (V n ) is a local minimum of (A. 3) [equation (2.7)] is that, the geodesic 
gradient vector of the improved Lagrangian function of (A. 3) evaluated at 
t n equals to zero. That is, 

&>Q*(t) _ \ dQ*(t) , ! dQ*(t) 

Ot { ] Ot 



dt 



(A.4) 



where 



t — tn 



t=t r 



d 9 f(V r 



Ot 



+ 



d g P (V n ) 



t — t T - 



Ot 



0, 



t — tfj 



U — (Cit, . . . , C^t, Ci2t, Ci3t, . . . , Cd-idt) 

is a (pdx [d(d+l)/2])-dimensional matrix, and 9 f(V n )/0t and 9 p(V n )/dt 
are defined in a similar form of d 9 Q*(t)/dt by replacing Q* with / and p, 
respectively. By Theorem 1 and noting that df(V n )/dt is linear in t, 



d 9 f(V n ) 



Ot 



d 9 f(V n 



t — t T - 



dt 



+ O p (n~ 1 / 2 ), 



t — t, n 



where t n is the vector form of V n . Using Theorem 3.1 of Rapcsak (2002), we 
have d 9 f(V n )/dt\ t=(n = 0, which yields that d 9 f (V n ) / dt\ t=in = O p {n~ 1 ' 2 ) 
and as a consequence 

d 9 p(V n )/dt\ t=in = O p (n~ 1 / 2 ). 

On the other hand, 
d 9 p(V n ) 



Ot 



[I^-UOJ'U^U^Hfl, 



t — 



where 







|v n i|| 2 '" 



otnlp 



Oltndl 



np II 2 



l^nllb'" 



np\\2j 



By using the fact that U has full column rank and HU = 0, we know 9 can be 
expressed through a linear combination of the columns of U in probability, 
that is, 

[|0|| 2 ~ 

6 = (kiCi H h K d C d + K12C12 + K13C13 H h ACd-idC d -id) ~ t w 



2 



where «i, . . . , Hd-id are a sequence of constants satisfy they are not all the 
zeros. Define a sequence of p<i-dimensional vectors Zj,-'s, 

C0 T t T T t T T ^ T 
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for j > i, say, its [{i — l)p + l]th to the [(i — l)p + p]th elements and [(j — 
l)p + l]th to the [(j — l)p + p]ih elements are both t n j. It is straightforward 
to see 

KoK^zp + Op^" 1 / 2 ), 

(A.5) 

K (Ki + Kij) = zfjO + O p (n 1/2 ) forj>i, 

where we denote kq = ||0||2/||t n ||2- By Theorem 1, v n j = O p {n~ 1 / 2 ) for j > q. 
Thus, by recalling the theorem's condition on a n and b n , it can be easily 
verified that (A. 5) leads to 

Ki + Kij = K^izfjO + O p {n- l l 2 )) 

< O p (b~ l ) ■ O p (a n + b n n^ 2 + n~ 1 /2 ) 

= o„(l). 

Similarly, Ki = o p (l). Consequently, we can conclude all the Ki and equal 
to zero in probability which yields contradiction. As a result, with probability 
tending to 1 (w.p.l), (A. 4) cannot hold, which implies there exists j > q so 
that 

Pr(v ni = 0) -)■ 1. 

Without loss of generality, we assume Pr(v np = 0) — > 1. Let M n i and N n i 
be the first (p — 1) x (p — 1) sub-matrices of M„ and N n , respectively, and 
V n i be the first p — 1 rows of V n . As stated before, V n is a local minimum 
of the objective function 

v 

Q(V;M n ) = -tr(V T M n V) +^^|| Vi || 2 subject to V r N n V = l d . 

i=l 

We will show that w.p.l V„i is also a local minimum of the objective function 

p-1 

Q(Vi;M n i) = - tr(Vi T M n iVi) + ^^11^112 

(A.6) 

subject to Vi N n iVi = 1^, 

w.p.l. Denote the set A\ = {Vi|||V"i - V n i|| s < 5; Vf N ni Vi = I d }. For any 
Ai £ Ai, denote A = (A^, T ) T . It is clear that A T N n A = Id- Given 5 small 
enough, we will have Q(A;M n ) > Q(V n ;M n ) since V n is the local min- 
imum. Note that Q(A;M n ) = Q(Ai;M n i) and Q(V;M n ) = Q(V nl ;M nl ) 
w.p.l. Consequently, we have 

Q(Ai;M nl )>Q(V nl ;M nl ) w.p.l, 
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for all Ai € A provided that 5 is sufficiently small. Hence, we can conclude 
that V n i is also a local minimum of the objective function Q(Vi;M„i) 
w.p.l. 

Rewriting (A. 6) as a similar form to (A. 3) and following the same ar- 
guments above in proving Pr(v np = 0) — > 1, we can show that there exists 
q < j < p so that Pr(v nj - = 0) — > 1. The remaining proofs can be completed 
by deduction. 

(ii) For convenience purposes, first decompose the matrix M n and N n 
into the following block form: 



Mr 



M 



M 



n(q) 



M 



12 



21 



M 



n{p-q) 



N 



N 



n(q) 



N 



12 



21 



N 



n{p-q) 



where M n ( ? ) and N n ( ? ) are the first q x q sub-matrices. It then follows that 



/(V; M n ) = - tr(Vf 9) M n(g) V w ) - tr^M^jV 



T 



(p-q) lyl n(p-q) X (p-q))- 



Theorem 1, since V 



n(q) 
n(p-q) 



Next we will show ~V n (q) = V n (o)(l + o p (n 1 ^ 2 )). Similar to the proof of 



n(O) 

w.p.l, it suffices to show, for any arbitrarily 



small e > 0, there exits a sufficiently large constant C, such that 



liminf Pr 



ZGTp 



i(O) 



} U L iit,,, Qi R ^n{0) + a nZ);G n ( (? ),N n((? )) 
(q,d) : ||B|| S =G 



it (0)i G n (q),N n ( ? )) 



> 1 - e, 



(A.7) 
where 

and G 

a n 2 {Q( R (^n(o) + a «Z); G n ( ? ),N n ( g )) - Q(r n ( ); G n ( ? ),N n ( g ))} 



\{0) = argmin-tr(T G n((?) r) 

rGR9 xd 

^OnwOote that 



subject to T 1 T = I d 



>[-tr(Z T G n(g) Z 



2a n 1 tr(Z T G n ( q )f n(0) ) + tr(Z T zf ^ (0) G n(g )f n(0) )] 



.Z T Z) 



x(l + o p (l)) 

- qWljN-ffiZ - (l/2)a n f n(0) ~ ~, ll2 , 
where 2a" 1 tr(Z T G n ( g )r n ( )) = by using Lemma 2, and 

- tr(Z T G n(9) Z) + tr(Z T zf T n{0) G n{q) f n(G) ) > 0. 

Using the similar arguments in the proof of Theorem 1, we can show (A.7) 
holds. This implies that ^/nT^^ is asymptotically equivalent to ^/nT^o) 
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where 

r„( g ) = argminQfT; G n ( g ),N n(g) ) subject to T T r = I d , 

1/2 ~ 1/2 — 

and thus it follows that 1 /nD(N n ^- ) V n ( 9 ), N n ^^V n (o)) = o p (l) which com- 
pletes the proof. □ 

Proof of Proposition 3. To illustrate the idea, we elaborate on ver- 
ifying the condition (2.10) for DR. In this case, by equation (5) in Li and 
Wang (2007), M n can be reexpressed as 

M n = 2{£y 2 £[Var(z|y) - Ip} 2 ^/ 2 

+ £y 2 %Var(z|y) - I p )E(z\y)E(z T \y)]Vl/ 2 

+ ^ 2 E[E(z\y)E(z T \y)(V^(z\y) - I p )]^/ 2 

+ ^ 2 E[E(z\y)E(z T \y)] 2 ^ 2 

+ ^ 2 E 2 [E(z\y)E(z T \m 1 n /2 

+ Sy 2 ^^(z r |y)^(z|y)]^[^(z|y) J g(z r |y)]Sy2 } 

:=2(M nl + --- + M n6 ). 

Here, y is the discretized y over a collection of slices, Var(z|y) denotes the 
sample covariance matrix of z within a slice, E(-) denotes the weighted 
average across slices. Next, we will show NL n ( )i = M n / g v + O p (n~ l ) for 
i = l,.. .,6. 

Now we first deal with M n i. Rewrite it as 

M n i = £{[Var(x|y) - S n ]5]- 1 [Var(x|y) - E„]}. 

We assume that the collection of slices is fixed; that is, it does not vary 
with n. This implies that the sample conditional moments such as Var(x|y) 
are -^/n-consistent estimates of their population- level counterparts, such as 
Var(x|y). Let fi be the matrix consisting of the first q columns of the matrix 
I p . Then, by definition, 

M n{0)1 = fi T £{[Var(x|y) - S n ]fi(n T S n n)- 1 fi T [Var(x|y) - E n ]}ft, 
M n(g)1 = fi T £{[Var(x|y) - E^E^Va^y) - S n ]}ft. 
Let P n (S n ) = fi(n T E n J2)- 1 n T S n and let Q n (S„) = I p - P n (S n ). Then 
M n(g)1 = n T E{[V^x( X \y) - E n ][P n (S n ) + Qn(S n )]S" 1 

x [P n (E n ) + Q n (E n )] T [Var(x|y) - S n ]}ft 
:=E(M U + Mm + M im + M 1IV ), 
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where 

Mn = fi T [Var(x|y) - S n ]P n (5] n )5]- 1 P^(5] n )[\^r(x|y) - E re ]«, 
Mm = r2 T [Var(x|y) - E n ]Q n (S n )5]- 1 P^(5] n )[Var(x|y) - E n ]fi, 
Mim = fF[Var(x|y) - E n ]P n (5] n )5]- 1 Q^(S n )[Var(x|y) - S n ]ft, 
Muv = fi T [Var(x|y) - S n ]Q n (S n )S- 1 Q^(E n )[\^r(x|y) - E n ]n. 

It can be easily seen that £(Mu) is exactly M„(q)i. We will show that 
Mm, Mmi and Muv are of the order O p (n _1 ). Note that 

Q£(E n ,)[Var(x|y)-S n ] 

= [Q£(E) + O^n" 1 / 3 )]^ (x|y) - S + p {n~ 1 ' 2 )} 

= Q^(E)[Var(x|y) - S] + Op^ 1 / 2 ). 
By construction, S y \ x C span(fi). Under certain conditions [Cook (1998a)], 
we know span{E _1 [E — Var(xjy)]} C <Sy| x . Hence, 

span{E _1 [E - Var(x|y)]} C span(fi). 
It then follows that 

(A.8) Q n (S)S- 1 [Var(x|y) — E] = S- 1 Q^(S)[Var(x|y) - E] = 0. 

Thus, we have Muv = O p {n~ 1 / 2 ) ■ Op(n" 1 / 2 ) = O p {n~ l ). 

Substituting Pn(E n ) = Ip — Qn(E n ) into Mm and using Qn(E n )'s idem- 
potency, we have 

Mm = ft T [Var(x|y) - S n ]Q n (E n )Q n (E n )E- 1 [^(x|y) - E n ]ft - M 1IV . 

By using (A.8) again, we know that Mm = O p (n~ l ). Similarly, Mmi = 
O p (n~ l ). From these, we deduce that Mm, Mmi, Muv are all of order 
O p (n _1 ). Since £(Mm + Mmi + Muv) is the sum of finite number of terms 
each of the order O p (n~ 1 ), it is itself of this order. It follows that M n (Q)i = 
M n(g) i + Op(n" 1 ). 

Next, let us deal with M n2 . Similar to M n ( ? )i, M n ( g ) 2 can be divided into 
four terms M n(g)2 = M n(0 ) 2 + M 2 n + M 2m + M 2 iv, where 

Man = ft T [Var(x|y) - S n ]Q n (S n )S- 1 

x P£(E n ){[£(x|y) - E(x)][E(x T \y) - £(x T )]}fi, 

M 3 ni = fi T [Var(x|y) - S n ]Pn(E n )E~ 1 

x Q£(S n ){[£(x|y) - £(x)][£(x T |y) - £(x T )]}f2, 

M 2 i V = ft T [Var(x|y) - S n ]Q n (S n )S- 1 

x Q^(E n ){[^(x|y) - E(x)][E(x T \y) - £(x T )]}ft. 
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Under the linearity condition, we know span{[£'(x|y) — E(pc)]} C S y \ x [Cook 
(1998a)]. Hence, 

span{[£(x|y) - E(x)]} C span(n). 

It then follows that 

(A.9) Q n (S)S^ 1 [ J E(x|y) - E(x)] = E^Q^E)^^) - E(x)] = 0. 

By using (A.9) and the similar arguments for M^gij, we can show that 
M211, M2111 and M21V are all of order O p {n~ l ). Thus, we can conclude that 
M n(0) 2 = M n(9)2 + O p (n- 1 ). 

By (A. 8) and (A.9), M. n ( )i = M n ( g )j + O p (n _1 ) for i = 3, . . . , 6, can be 
proved in a similar fashion to the foregoing proofs. We omit the details here 
for saving some space. It follows that for the DR method, 

M n(0) = M n(g) +O p (?i- 1 ). 

Thus, condition (2.10) is satisfied as long as (na n ) _1 = O p (l). 

Note that for SAVE, M n takes the form of M n i for DR. Thus, condi- 
tion (2.10) is also satisfied for SAVE. □ 
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